首页> 外文OA文献 >Memory consistency directed cache coherence protocols for scalable multiprocessors
【2h】

Memory consistency directed cache coherence protocols for scalable multiprocessors

机译:适用于可扩展多处理器的内存一致性定向高速缓存一致性协议

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

The memory consistency model, which formally specifies the behavior of the\udmemory system, is used by programmers to reason about parallel programs. From a\udhardware design perspective, weaker consistency models permit various optimizations\udin a multiprocessor system: this thesis focuses on designing and optimizing the cache\udcoherence protocol for a given target memory consistency model.\udTraditional directory coherence protocols are designed to be compatible with the\udstrictest memory consistency model, sequential consistency (SC). When they are used\udfor chip multiprocessors (CMPs) that provide more relaxed memory consistency models,\udsuch protocols turn out to be unnecessarily strict. Usually, this comes at the cost of\udscalability, in terms of per-core storage due to sharer tracking, which poses a problem\udwith increasing number of cores in today’s CMPs, most of which no longer are sequentially\udconsistent. The recent convergence towards programming language based relaxed\udmemory consistency models has sparked renewed interest in lazy cache coherence\udprotocols. These protocols exploit synchronization information by enforcing coherence\udonly at synchronization boundaries via self-invalidation. As a result, such protocols do\udnot require sharer tracking which benefits scalability. On the downside, such protocols\udare only readily applicable to a restricted set of consistency models, such as Release\udConsistency (RC), which expose synchronization information explicitly. In particular,\udexisting architectures with stricter consistency models (such as x86) cannot readily\udmake use of lazy coherence protocols without either: adapting the protocol to satisfy\udthe stricter consistency model; or changing the architecture’s consistency model to (a\udvariant of) RC, typically at the expense of backward compatibility. The first part of\udthis thesis explores both these options, with a focus on a practical approach satisfying\udbackward compatibility.\udBecause of the wide adoption of Total Store Order (TSO) and its variants in x86 and\udSPARC processors, and existing parallel programs written for these architectures, we\udfirst propose TSO-CC, a lazy cache coherence protocol for the TSO memory consistency\udmodel. TSO-CC does not track sharers and instead relies on self-invalidation and\uddetection of potential acquires (in the absence of explicit synchronization) using per\udcache line timestamps to efficiently and lazily satisfy the TSO memory consistency\udmodel. Our results show that TSO-CC achieves, on average, performance comparable\udto a MESI directory protocol, while TSO-CC’s storage overhead per cache line scales\udlogarithmically with increasing core count.\udNext, we propose an approach for the x86-64 architecture, which is a compromise\udbetween retaining the original consistency model and using a more storage efficient\udlazy coherence protocol. First, we propose a mechanism to convey synchronization\udinformation via a simple ISA extension, while retaining backward compatibility with\udlegacy codes and older microarchitectures. Second, we propose RC3 (based on TSOCC),\uda scalable cache coherence protocol for RCtso, the resulting memory consistency\udmodel. RC3 does not track sharers and relies on self-invalidation on acquires. To\udsatisfy RCtso efficiently, the protocol reduces self-invalidations transitively using per-L1\udtimestamps only. RC3 outperforms a conventional lazy RC protocol by 12%, achieving\udperformance comparable to a MESI directory protocol for RC optimized programs.\udRC3’s storage overhead per cache line scales logarithmically with increasing core count\udand reduces on-chip coherence storage overheads by 45% compared to TSO-CC.\udFinally, it is imperative that hardware adheres to the promised memory consistency\udmodel. Indeed, consistency directed coherence protocols cannot use conventional coherence\uddefinitions (e.g. SWMR) to be verified against, and few existing verification\udmethodologies apply. Furthermore, as the full consistency model is used as a specification,\udtheir interaction with other components (e.g. pipeline) of a system must not be\udneglected in the verification process. Therefore, verifying a system with such protocols\udin the context of interacting components is even more important than before. One\udcommon way to do this is via executing tests, where specific threads of instruction\udsequences are generated and their executions are checked for adherence to the consistency\udmodel. It would be extremely beneficial to execute such tests under simulation,\udi.e. when the functional design implementation of the hardware is being prototyped.\udMost prior verification methodologies, however, target post-silicon environments, which\udwhen used for simulation-based memory consistency verification would be too slow.\udWe propose McVerSi, a test generation framework for fast memory consistency\udverification of a full-system design implementation under simulation. Our primary\udcontribution is a Genetic Programming (GP) based approach to memory consistency test\udgeneration, which relies on a novel crossover function that prioritizes memory operations\udcontributing to non-determinism, thereby increasing the probability of uncovering\udmemory consistency bugs. To guide tests towards exercising as much logic as possible,\udthe simulator’s reported coverage is used as the fitness function. Furthermore, we\udincrease test throughput by making the test workload simulation-aware. We evaluate\udour proposed framework using the Gem5 cycle accurate simulator in full-system mode\udwith Ruby (with configurations that use Gem5’s MESI protocol, and our proposed\udTSO-CC together with an out-of-order pipeline). We discover 2 new bugs in the MESI\udprotocol due to the faulty interaction of the pipeline and the cache coherence protocol,\udhighlighting that even conventional protocols should be verified rigorously in the\udcontext of a full-system. Crucially, these bugs would not have been discovered through\udindividual verification of the pipeline or the coherence protocol. We study 11 bugs\udin total. Our GP-based test generation approach finds all bugs consistently, therefore\udproviding much higher guarantees compared to alternative approaches (pseudo-random\udtest generation and litmus tests).
机译:内存一致性模型正式指定了\ udmmory系统的行为,程序员使用它来推理并行程序。从硬件设计的角度来看,较弱的一致性模型允许在多处理器系统中进行各种优化:本文主要针对给定的目标内存一致性模型设计和优化缓存\ udcoherence协议。\ ud传统目录一致性协议旨在与最严格的内存一致性模型,顺序一致性(SC)。当将它们用于提供更宽松的内存一致性模型的芯片多处理器(CMP)时,这样的协议就会变得不必要地严格。通常,这会以\可扩展性为代价,这是由于共享者跟踪导致的每核存储量的增加,这导致了当今CMP中内核数量增加的问题,而大多数CMP不再具有顺序性。最近对基于编程语言的宽松\内存一致性模型的收敛引发了人们对惰性缓存一致性\ udprotocols的重新关注。这些协议通过通过自我失效在同步边界处强制执行一致性\ udonly来利用同步信息。结果,这样的协议不需要共享者跟踪,这有利于可伸缩性。不利的一面是,此类协议仅敢于适用于一组受限的一致性模型,例如Release \ udConsistency(RC),它显式公开了同步信息。特别是,具有更严格的一致性模型(例如x86)的uxuxisting体系结构不能轻易地\懒用懒惰的一致性协议,而没有以下两种方法之一:修改协议以满足更严格的一致性模型;或将架构的一致性模型更改为RC(或不兼容),通常以向后兼容为代价。本论文的第一部分探讨了这两种选择,并着重于满足\ udback兼容性的实用方法。\ ud由于在x86和\ udSPARC处理器中广泛采用了Total Store Order(TSO)及其变体,并且现有并行针对这些架构编写的程序,我们首先提出TSO-CC,这是用于TSO内存一致性ud模型的惰性缓存一致性协议。 TSO-CC不跟踪共享者,而是依靠使用每个\ udcache行时间戳来自我验证和/或潜在获取的uddetect(在没有显式同步的情况下),以高效,懒惰地满足TSO内存一致性\ udmodel。我们的结果表明,TSO-CC的平均性能可与MESI目录协议媲美,而TSO-CC的每条缓存行的存储开销随核数的增加呈对数比例扩展。\ ud接下来,我们为x86-64提出了一种方法。架构,这是保留原始一致性模型与使用存储效率更高的\ udlazy一致性协议之间的折衷方案。首先,我们提出了一种通过简单的ISA扩展传达同步\ udinformation的机制,同时保持了与\ legacyacy代码和较旧的微体系结构的向后兼容性。其次,我们提出了RC3(基于TSOCC),用于RCtso的可扩展缓存一致性协议,以及由此产生的内存一致性\ udmodel。 RC3不会跟踪共享者,而是依赖于收购的自我失效。为了有效地使RCtso满意,该协议仅使用每L1 \ udtimestamps来暂时减少自失效。 RC3的性能比传统的惰性RC协议高出12%,与RC优化程序的MESI目录协议相比,性能达到了\ udud。\ udRC3的每条高速缓存行的存储开销随着核数的增加而呈对数级扩\ udd将片上一致性存储开销降低了45%与TSO-CC相比。\ ud最后,硬件必须遵守承诺的内存一致性\ udmodel。实际上,一致性指导的一致性协议不能使用要进行验证的常规一致性\ uddefinitions(例如SWMR),并且几乎没有适用的现有验证\ udmethodology。此外,由于将完全一致性模型用作规范,因此在验证过程中不得忽略它们与系统其他组件(例如管道)的交互。因此,在交互组件的上下文中验证具有此类协议的系统比以前更加重要。一种常见的方法是通过执行测试,在该测试中生成特定的指令\ udsequence线程,并检查其执行是否符合一致性\ udmodel。在仿真下执行这样的测试将是非常有益的。 \ ud大多数现有的验证方法,但是,目标硅后环境,\ udw,用于基于仿真的内存一致性验证时,会太慢。\ ud我们建议使用McVerSi,一种测试生成框架,用于在仿真下快速进行内存一致性\全系统设计的验证。我们的主要\贡献是基于遗传编程(GP)的内存一致性测试\生成方法,它依赖于一种新颖的交叉功能,该功能优先考虑内存操作\对非确定性的贡献,从而增加了发现\内存一致性错误的可能性。为了指导测试尽可能多地运用逻辑,\ ud把模拟器报告的覆盖率用作适应度函数。此外,我们通过使测试工作负载具有仿真能力来增加测试吞吐量。我们在全系统模式下使用Gem5周期精确模拟器对ud提议的框架进行了评估\使用Ruby(带有使用Gem5的MESI协议的配置,以及我们提议的udTSO-CC和乱序的管道)。由于流水线和缓存一致性协议之间的错误交互,我们在MESI \ udprotocol中发现了2个新错误,从而突显了即使常规协议也应在整个系统的udcon上下文中进行严格验证。至关重要的是,无法通过管道或一致性协议的\\个人验证来发现这些错误。我们研究了11个错误\ udin。我们基于GP的测试生成方法能够始终如一地发现所有错误,因此与其他方法(伪随机\ udtest生成和石蕊测试)相比,可以提供更高的保证。

著录项

  • 作者

    Elver, Marco Iskender;

  • 作者单位
  • 年度 2016
  • 总页数
  • 原文格式 PDF
  • 正文语种 en
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号